2 Jun 2025
I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.
Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.
Warning
You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:
Materials
Gisteren hebben we deze onderwerpen behandeld:
Vandaag behandelen we de volgende onderwerpen:
The linear regression model:
\[y_i=\beta_0+\sum_{j}\beta_{j} x_{ij}+\varepsilon_i, \ \ \ \ \ \ \varepsilon_i\sim N(0, \sigma^2)\] where
\(y_i\) is score of individual \(i\) on the numeric dependent variable \(Y\)
\(x_{ij}\) is the score of individual \(i\) on predictor \(X_j\)
\(\beta_0\) is the intercept
\(\beta_j\) is the slope of \(X_j\)
\(\varepsilon_{i}\) is the residual (prediction error)
lm() function| formula | model |
|---|---|
y ~ 1 |
intercept-only |
y ~ x1 |
main effect of x1 |
y ~ x1 + x2 |
main effects of x1, x2 |
y ~ . |
main effects of all predictors |
y ~ . - x1 |
main effects of all predictors except x1 |
y ~ x1 + x2 + x1:x2 |
main effects + interaction between x1 and x2 |
y ~ x1*x2 |
idem |
y ~ .^2 |
main effects + pairwise interactions between all predictors |
\[y_i=\hat{y}_i+\varepsilon_i\] - fitted = intercept + slope times x
\[\hat{y}_i=\beta_0 + \beta_1x_i\]
mpg ~ dispInterpretation of the parameter estimates.
lm objectList of 12
$ coefficients : Named num [1:2] 29.5999 -0.0412
..- attr(*, "names")= chr [1:2] "(Intercept)" "disp"
$ residuals : Named num [1:32] -2.01 -2.01 -2.35 2.43 3.94 ...
..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
$ effects : Named num [1:32] -113.65 -28.44 -1.79 2.65 3.92 ...
..- attr(*, "names")= chr [1:32] "(Intercept)" "disp" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:32] 23 23 25.1 19 14.8 ...
..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
.. .. ..$ : chr [1:2] "(Intercept)" "disp"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.18 1.09
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 30
$ xlevels : Named list()
$ call : language lm(formula = mpg ~ disp, data = mtcars)
$ terms :Classes 'terms', 'formula' language mpg ~ disp
.. ..- attr(*, "variables")= language list(mpg, disp)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "mpg" "disp"
.. .. .. ..$ : chr "disp"
.. ..- attr(*, "term.labels")= chr "disp"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(mpg, disp)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
$ model :'data.frame': 32 obs. of 2 variables:
..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
..$ disp: num [1:32] 160 160 108 258 360 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ disp
.. .. ..- attr(*, "variables")= language list(mpg, disp)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "mpg" "disp"
.. .. .. .. ..$ : chr "disp"
.. .. ..- attr(*, "term.labels")= chr "disp"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(mpg, disp)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
- attr(*, "class")= chr "lm"
lm object
Call:
lm(formula = mpg ~ disp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.8922 -2.2022 -0.9631 1.6272 7.2305
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
disp -0.041215 0.004712 -8.747 9.38e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.251 on 30 degrees of freedom
Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
lm objectList of 11
$ call : language lm(formula = mpg ~ disp, data = mtcars)
$ terms :Classes 'terms', 'formula' language mpg ~ disp
.. ..- attr(*, "variables")= language list(mpg, disp)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "mpg" "disp"
.. .. .. ..$ : chr "disp"
.. ..- attr(*, "term.labels")= chr "disp"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(mpg, disp)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
$ residuals : Named num [1:32] -2.01 -2.01 -2.35 2.43 3.94 ...
..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
$ coefficients : num [1:2, 1:4] 29.59985 -0.04122 1.22972 0.00471 24.07041 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "(Intercept)" "disp"
.. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
$ aliased : Named logi [1:2] FALSE FALSE
..- attr(*, "names")= chr [1:2] "(Intercept)" "disp"
$ sigma : num 3.25
$ df : int [1:3] 2 30 2
$ r.squared : num 0.718
$ adj.r.squared: num 0.709
$ fstatistic : Named num [1:3] 76.5 1 30
..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
$ cov.unscaled : num [1:2, 1:2] 1.43e-01 -4.85e-04 -4.85e-04 2.10e-06
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "(Intercept)" "disp"
.. ..$ : chr [1:2] "(Intercept)" "disp"
- attr(*, "class")= chr "summary.lm"
lm list elements:| Function / Subsetting | Output |
|---|---|
coef(fit) / fit$coef |
coefficients |
fitted(fit) / fit$fitted |
fitted values |
resid(fit) / fit$resid |
residuals |
summary(fit)$r.squared |
R-squared statistic |
Categorical predictors are converted into dummy variables:
each category has a dummy with value 1 for that category, and 0 otherwise
except for the reference category (0 on all dummies)
all categories are compared to the reference category
Reference category of \(z\) is \(a\)
Model for categorical \(Z\) with categories \(a, b, c\):
\[\hat{y}=\beta_0+\beta_1zb+\beta_2zc\]
| parameters | interpretation |
|---|---|
| \(\beta_0\) | predicted mean category \(a\) (reference category) |
| \(\beta_0+\beta_1\) | predicted mean category \(b\) |
| \(\beta_0+\beta_2\) | predicted mean category \(c\) |
Interpret the parameter estimates of model mpg ~ factor(am)
am = 0 is automatic and am = 1 is manual transmission
reference category is am = 0
The linear regression model:
\[y_i=\beta_0+\sum_{j}\beta_{j} x_{ij}+\varepsilon_i, \ \ \ \ \ \ \varepsilon_i\sim N(0, \sigma^2)\] where
\(y_i\) is score of individual \(i\) on the numeric dependent variable \(Y\)
\(x_{ij}\) is the score of individual \(i\) on predictor \(X_j\)
\(\beta_0\) is the intercept
\(\beta_j\) is the slope of \(X_j\)
\(\varepsilon_{i}\) is the residual (prediction error)
lm() function| formula | model |
|---|---|
y ~ 1 |
intercept-only |
y ~ x1 |
main effect of x1 |
y ~ x1 + x2 |
main effects of x1, x2 |
y ~ . |
main effects of all predictors |
y ~ . - x1 |
main effects of all predictors except x1 |
y ~ x1 + x2 + x1:x2 |
main effects + interaction between x1 and x2 |
y ~ x1*x2 |
idem |
y ~ .^2 |
main effects + pairwise interactions between all predictors |
\[y_i=\hat{y}_i+\varepsilon_i\] - fitted = intercept + slope times x
\[\hat{y}_i=\beta_0 + \beta_1x_i\]
mpg ~ dispInterpretation of the parameter estimates.
lm objectList of 12
$ coefficients : Named num [1:2] 29.5999 -0.0412
..- attr(*, "names")= chr [1:2] "(Intercept)" "disp"
$ residuals : Named num [1:32] -2.01 -2.01 -2.35 2.43 3.94 ...
..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
$ effects : Named num [1:32] -113.65 -28.44 -1.79 2.65 3.92 ...
..- attr(*, "names")= chr [1:32] "(Intercept)" "disp" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:32] 23 23 25.1 19 14.8 ...
..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
.. .. ..$ : chr [1:2] "(Intercept)" "disp"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.18 1.09
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 30
$ xlevels : Named list()
$ call : language lm(formula = mpg ~ disp, data = mtcars)
$ terms :Classes 'terms', 'formula' language mpg ~ disp
.. ..- attr(*, "variables")= language list(mpg, disp)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "mpg" "disp"
.. .. .. ..$ : chr "disp"
.. ..- attr(*, "term.labels")= chr "disp"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(mpg, disp)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
$ model :'data.frame': 32 obs. of 2 variables:
..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
..$ disp: num [1:32] 160 160 108 258 360 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ disp
.. .. ..- attr(*, "variables")= language list(mpg, disp)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "mpg" "disp"
.. .. .. .. ..$ : chr "disp"
.. .. ..- attr(*, "term.labels")= chr "disp"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(mpg, disp)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
- attr(*, "class")= chr "lm"
lm object
Call:
lm(formula = mpg ~ disp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.8922 -2.2022 -0.9631 1.6272 7.2305
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
disp -0.041215 0.004712 -8.747 9.38e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.251 on 30 degrees of freedom
Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
lm objectList of 11
$ call : language lm(formula = mpg ~ disp, data = mtcars)
$ terms :Classes 'terms', 'formula' language mpg ~ disp
.. ..- attr(*, "variables")= language list(mpg, disp)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "mpg" "disp"
.. .. .. ..$ : chr "disp"
.. ..- attr(*, "term.labels")= chr "disp"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(mpg, disp)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
$ residuals : Named num [1:32] -2.01 -2.01 -2.35 2.43 3.94 ...
..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
$ coefficients : num [1:2, 1:4] 29.59985 -0.04122 1.22972 0.00471 24.07041 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "(Intercept)" "disp"
.. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
$ aliased : Named logi [1:2] FALSE FALSE
..- attr(*, "names")= chr [1:2] "(Intercept)" "disp"
$ sigma : num 3.25
$ df : int [1:3] 2 30 2
$ r.squared : num 0.718
$ adj.r.squared: num 0.709
$ fstatistic : Named num [1:3] 76.5 1 30
..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
$ cov.unscaled : num [1:2, 1:2] 1.43e-01 -4.85e-04 -4.85e-04 2.10e-06
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "(Intercept)" "disp"
.. ..$ : chr [1:2] "(Intercept)" "disp"
- attr(*, "class")= chr "summary.lm"
lm list elements:| Function / Subsetting | Output |
|---|---|
coef(fit) / fit$coef |
coefficients |
fitted(fit) / fit$fitted |
fitted values |
resid(fit) / fit$resid |
residuals |
summary(fit)$r.squared |
R-squared statistic |
Categorical predictors are converted into dummy variables:
each category has a dummy with value 1 for that category, and 0 otherwise
except for the reference category (0 on all dummies)
all categories are compared to the reference category
Reference category of \(z\) is \(a\)
Model for categorical \(Z\) with categories \(a, b, c\):
\[\hat{y}=\beta_0+\beta_1zb+\beta_2zc\]
| parameters | interpretation |
|---|---|
| \(\beta_0\) | predicted mean category \(a\) (reference category) |
| \(\beta_0+\beta_1\) | predicted mean category \(b\) |
| \(\beta_0+\beta_2\) | predicted mean category \(c\) |
Interpret the parameter estimates of model mpg ~ factor(am)
am = 0 is automatic and am = 1 is manual transmission
reference category is am = 0
Call:
lm(formula = age ~ reg)
Coefficients:
(Intercept) regeast regwest regsouth regcity
14.7109 -0.8168 -0.7044 -0.6913 -0.6663
Group.1 x
1 north 14.71094
2 east 13.89410
3 west 14.00657
4 south 14.01965
5 city 14.04460
Call:
lm(formula = age ~ reg)
Residuals:
Min 1Q Median 3Q Max
-5.8519 -2.5301 0.0254 2.2274 6.2614
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.7109 0.5660 25.993 <2e-16 ***
regeast -0.8168 0.7150 -1.142 0.255
regwest -0.7044 0.6970 -1.011 0.313
regsouth -0.6913 0.6970 -0.992 0.322
regcity -0.6663 0.9038 -0.737 0.462
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.151 on 218 degrees of freedom
Multiple R-squared: 0.006745, Adjusted R-squared: -0.01148
F-statistic: 0.3701 on 4 and 218 DF, p-value: 0.8298
Analysis of Variance Table
Response: age
Df Sum Sq Mean Sq F value Pr(>F)
reg 4 14.7 3.6747 0.3701 0.8298
Residuals 218 2164.6 9.9293
It is not a very informative model. The anova is not significant, indicating that the contribution of the residuals is larger than the contribution of the model.
The outcome age does not change significantly when reg is varied.
(Intercept) regeast regwest regsouth regcity
1 1 0 0 0 0
2 1 0 0 1 0
3 1 0 0 1 0
4 1 1 0 0 0
5 1 0 1 0 0
6 1 1 0 0 0
7 1 0 0 0 0
8 1 0 1 0 0
9 1 0 0 1 0
10 1 0 1 0 0
R expands the categorical variable for us
5 categories into 4 dummies (and an intercept).
Call:
aov(formula = .)
Residuals:
Min 1Q Median 3Q Max
-5.8519 -2.5301 0.0254 2.2274 6.2614
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.7109 0.5660 25.993 <2e-16 ***
regeast -0.8168 0.7150 -1.142 0.255
regwest -0.7044 0.6970 -1.011 0.313
regsouth -0.6913 0.6970 -0.992 0.322
regcity -0.6663 0.9038 -0.737 0.462
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.151 on 218 degrees of freedom
Multiple R-squared: 0.006745, Adjusted R-squared: -0.01148
F-statistic: 0.3701 on 4 and 218 DF, p-value: 0.8298
Without adjustments for the p-value
With adjusted p-values cf. Bonferoni correction
Manually calculated
(Intercept) regeast regwest regsouth regcity
5.077098e-68 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
If you have trouble reading scientific notation, 5.077098e-68 means the following
\[5.077098\text{e-68} = 5.077098 \times 10^{-68} = 5.077098 \times (\frac{1}{10})^{-68}\]
This indicates that the comma should be moved 68 places to the left:
\[5.077098\text{e-68} = .000000000000000000000000000000000000\] \[000000000000000000000000000000005077098\]
Akaike’s An Information Criterion
AIC comes from information theory and can be used for model selection. The AIC quantifies the information that is lost by the statistical model, through the assumption that the data come from the same model. In other words: AIC measures the fit of the model to the data.
Let’s add predictor hgt to the model:
Let’s add wgt to the model
Let’s add wgt and the interaction between wgt and hgt to the model
is equivalent to
Analysis of Variance Table
Model 1: age ~ reg
Model 2: age ~ reg + hgt
Model 3: age ~ reg + hgt + wgt
Model 4: age ~ reg + hgt * wgt
Res.Df RSS Df Sum of Sq F Pr(>F)
1 218 2164.59
2 217 521.64 1 1642.94 731.8311 < 2.2e-16 ***
3 216 485.66 1 35.98 16.0276 8.595e-05 ***
4 215 482.67 1 2.99 1.3324 0.2497
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boys.fit3Analysis of Variance Table
Response: age
Df Sum Sq Mean Sq F value Pr(>F)
reg 4 14.70 3.67 1.6343 0.1667
hgt 1 1642.94 1642.94 730.7065 < 2.2e-16 ***
wgt 1 35.98 35.98 16.0029 8.688e-05 ***
Residuals 216 485.66 2.25
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boys.fit4Analysis of Variance Table
Response: age
Df Sum Sq Mean Sq F value Pr(>F)
reg 4 14.70 3.67 1.6368 0.1661
hgt 1 1642.94 1642.94 731.8311 < 2.2e-16 ***
wgt 1 35.98 35.98 16.0276 8.595e-05 ***
hgt:wgt 1 2.99 2.99 1.3324 0.2497
Residuals 215 482.67 2.24
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It seems that reg and the interaction hgt:wgt are redundant
regLet’s revisit the comparison
Analysis of Variance Table
Model 1: age ~ reg
Model 2: age ~ reg + hgt
Model 3: age ~ reg + hgt + wgt
Model 4: age ~ hgt + wgt
Res.Df RSS Df Sum of Sq F Pr(>F)
1 218 2164.59
2 217 521.64 1 1642.94 730.7065 < 2.2e-16 ***
3 216 485.66 1 35.98 16.0029 8.688e-05 ***
4 220 492.43 -4 -6.77 0.7529 0.5571
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
But the boys.fit5 model is better than the previous model with fewer parameters
We start with the full model, which contains all parameters for all columns.
The most straightforward way to go about this is by specifying the following model:
Call:
lm(formula = age ~ ., data = na.omit(boys))
Coefficients:
(Intercept) hgt wgt bmi hc gen.L
2.556051 0.059987 -0.009846 0.142162 -0.024086 1.287455
gen.Q gen.C gen^4 phb.L phb.Q phb.C
-0.006861 -0.187256 0.034186 1.552398 0.499620 0.656277
phb^4 phb^5 tv regeast regwest regsouth
-0.094722 -0.113686 0.074321 -0.222249 -0.233307 -0.258771
regcity
0.188423
We can then start with specifying the stepwise model. In this case we choose direction both.
Call:
lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))
Coefficients:
(Intercept) hgt bmi phb.L phb.Q phb.C
2.07085 0.05482 0.10742 2.70328 0.28877 0.48492
phb^4 phb^5 tv
-0.06225 -0.15167 0.07957
Other options are
forward: fit all univariate models, add the best predictor and continue.backward: fit the full model, eliminate the worst predictor and continue.
Call:
lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))
Residuals:
Min 1Q Median 3Q Max
-3.5077 -0.7144 -0.0814 0.6850 3.1724
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.070854 1.576365 1.314 0.19036
hgt 0.054822 0.009986 5.490 1.13e-07 ***
bmi 0.107415 0.030135 3.564 0.00045 ***
phb.L 2.703275 0.437108 6.184 3.12e-09 ***
phb.Q 0.288768 0.211836 1.363 0.17426
phb.C 0.484921 0.202856 2.390 0.01769 *
phb^4 -0.062246 0.208472 -0.299 0.76555
phb^5 -0.151667 0.240599 -0.630 0.52912
tv 0.079573 0.019600 4.060 6.89e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.172 on 214 degrees of freedom
Multiple R-squared: 0.8651, Adjusted R-squared: 0.86
F-statistic: 171.5 on 8 and 214 DF, p-value: < 2.2e-16
full.model <- lm(age ~ ., data = na.omit(boys))
step.model <- MASS::stepAIC(full.model, direction = "both",
trace = FALSE)
step.model
Call:
lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))
Coefficients:
(Intercept) hgt bmi phb.L phb.Q phb.C
2.07085 0.05482 0.10742 2.70328 0.28877 0.48492
phb^4 phb^5 tv
-0.06225 -0.15167 0.07957
DfBeta calculates the change in coefficients depicted as deviation in SE’s.
(Intercept) hgt bmi phb.L phb.Q
3279 -0.142127090 0.0010828575 -0.0021654085 -0.021523825 -0.0006087783
3283 -0.005201914 0.0001127452 -0.0014100750 0.009068652 -0.0147293625
3296 -0.081439010 0.0004244061 0.0002603379 -0.009247562 -0.0071293767
3321 -0.084630826 0.0004186923 0.0008222594 -0.005498486 -0.0059276580
3323 0.154386486 -0.0004476521 -0.0052964367 0.042872835 -0.0213761347
3327 -0.046095468 0.0001081361 0.0011150546 -0.004047875 -0.0085461280
3357 -0.045933397 0.0001345605 0.0009770371 -0.005087053 -0.0062604484
phb.C phb^4 phb^5 tv
3279 0.007427067 -0.004226982 -0.001213833 0.0001061106
3283 0.011761595 -0.005498411 0.001164307 0.0007051901
3296 0.008538302 -0.003688611 0.000842872 0.0002494383
3321 0.006839217 -0.003341741 0.000866896 -0.0002569914
3323 0.011758694 -0.006659755 0.000493897 0.0012317886
3327 0.007807966 -0.002901270 0.001499505 0.0003376220
3357 0.006152384 -0.002274777 0.001148552 0.0002329346
Let’s use the simpler anscombe data example
The residual is then calculated as
If we introduce new values for the predictor x1, we can generate predicted values from the model
1 2 3 4 5 6 7 8
3.500182 4.000273 4.500364 5.000455 5.500545 6.000636 6.500727 7.000818
9 10 11 12 13 14 15 16
7.500909 8.001000 8.501091 9.001182 9.501273 10.001364 10.501455 11.001545
17 18 19 20
11.501636 12.001727 12.501818 13.001909
fit lwr upr
1 8.001000 5.067072 10.934928
2 7.000818 4.066890 9.934747
3 9.501273 6.390801 12.611745
4 7.500909 4.579129 10.422689
5 8.501091 5.531014 11.471168
6 10.001364 6.789620 13.213107
7 6.000636 2.971271 9.030002
8 5.000455 1.788711 8.212198
9 9.001182 5.971816 12.030547
10 6.500727 3.530650 9.470804
11 5.500545 2.390073 8.611017
A prediction interval reflects the uncertainty around a single value. The confidence interval reflects the uncertainty around the mean prediction values.
If we would not have used na.omit()
If an infinite number of samples were drawn and CI’s computed, then the true population mean \(\mu\) would be in at least 95% of these intervals
[ 95%~CI={x}_{(1-/2)}SEM ]
Example
Neyman, J. (1934). On the Two Different Aspects of the Representative Method:
The Method of Stratified Sampling and the Method of Purposive Selection.
Journal of the Royal Statistical Society, Vol. 97, No. 4 (1934), pp. 558-625
Confidence intervals are frequently misunderstood, even well-established researchers sometimes misinterpret them. .
that there is a 95% probability the population parameter lies within the interval
that there is a 95% probability that the interval covers the population parameter
Once an experiment is done and an interval is calculated, the interval either covers, or does not cover the parameter value. Probability is no longer involved.
The 95% probability only has to do with the estimation procedure.
100 simulated samples from a population with \(\mu = 0\) and \(\sigma^2=1\). Out of 100 samples, only 5 samples have confidence intervals that do not cover the population mean.
At this point we have covered the following models:
[y=+x+]
The relationship between a numerical outcome and a numerical or categorical predictor
[y=+_1 x_1 + _2 x_2 + _p x_p + ]
The relationship between a numerical outcome and multiple numerical or categorical predictors
We have not yet covered how to handle outcomes that are not categorical or how to deal with predictors that are nonlinear or have a strict dependency structure.
We have covered the following topics last week:
Instead of modeling
[y=+x+]
we can also consider [[y] = + x]
They’re the same. Different notation, different framework.
The upside is that we can now use a function for the expectation \(\mathbb{E}\) to allow for transformations. This would enable us to change \(\mathbb{E}[y]\) such that \(f(\mathbb{E}[y])\) has a linear relation with \(x\).
This is what we will be doing today
simulated data setTo further illustrate why the linear model is not an appropriate model for discrete data I propose the following simple simulated data set:
set.seed(123)
simulated <- data.frame(discrete = c(rep(0, 50), rep(1, 50)),
continuous = c(rnorm(50, 10, 3), rnorm(50, 15, 3)))
simulated %>% summary discrete continuous
Min. :0.0 Min. : 4.100
1st Qu.:0.0 1st Qu.: 9.656
Median :0.5 Median :12.904
Mean :0.5 Mean :12.771
3rd Qu.:1.0 3rd Qu.:15.570
Max. :1.0 Max. :21.562
This data allows us to illustrate modeling the relation between the discrete outcome and the continuous predictor with logistic regression.
Remember that fixing the random seed allows for a replicable random number generator sequence.
simulated datasimulated with lmThe orange line represents the lm linear regression line. It is not a good representation for our data, as it assumes the data are continuous and projects values outside of the range of the observed data.
simulated with glmThe blue glm logistic regression line represents this data infinitely better than the orange lm line. It assumes the data to be 0 or 1 and does not project values outside of the range of the observed data.
There is a very general way of addressing this type of problem in regression. The models that use this general way are called generalized linear models (GLMs).
Every generalized linear model has the following three characteristics:
The linear predictor model in (2) is \[\eta = \bf{X}\beta\] where \(\eta\) denotes a linear predictor and the link function in (3) is \[\bf{X}\beta = g(\mu)\] The technique to model a binary outcome based on a set of continuous or discrete predictors is called logistic regression. Logistic regression is an example of a generalized linear model.
The link function for logistic regression is the logit link
\[\bf{X}\beta = ln(\frac{\mu}{1 - \mu})\]
where \[\mu = \frac{\text{exp}(\bf{X}\beta)}{1 + \text{exp}(\bf{X}\beta)} = \frac{1}{1 + \text{exp}(-\bf{X}\beta)}\]
Before we continue with discussing the link function, we are first going to dive into the concept of odds.
Properly understanding odds is necessary to perform and interpret logistic regression, as the logit link is connected to the odds
Odds are a way of quantifying the probability of an event \(E\)
The odds for an event \(E\) are \[\text{odds}(E) = \frac{P(E)}{P(E^c)} = \frac{P(E)}{1 - P(E)}\] The odds of getting heads in a coin toss is
\[\text{odds}(\text{heads}) = \frac{P(\text{heads})}{P(\text{tails})} = \frac{P(\text{heads})}{1 - P(\text{heads})}\] For a fair coin, this would result in
\[\text{odds}(\text{heads}) = \frac{.5}{1 - .5} = 1\]
The game Lingo has 44 balls: 36 blue, 6 red and 2 green balls
Odds of 1 indicate an equal likelihood of the event occuring or not occuring. Odds < 1 indicate a lower likelihood of the event occuring vs. not occuring. Odds > 1 indicate a higher likelihood of the event occuring.
Remember that [y=+x+,]
and that [[y] = + x.]
As a result [y = [y] + .]
and residuals do not need to be normal (heck, \(y\) probably isn’t, so why should \(\epsilon\) be?)
Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors.
In logistic regression we assume that the true data generating model for the outcome variable follows a binomial distribution.
We specify a reasonable link that connects \(\eta\) to \(p\). Most common in logistic regression is the logit link
\[logit(p)=\text{log}(\frac{p}{1−p}) , \text{ for } 0 \leq p \leq 1\] We might recognize \(\frac{p}{1−p}\) as the odds.
Now if we visualize the relation between our predictor(s) and the logodds
And the relation between our predictor(s) and the probability
With linear regression we had the Sum of Squares (SS). Its logistic counterpart is the Deviance (D).
With logistic regression we aim to maximize the likelihood, which is equivalent to minimizing the deviance.
The likelihood is the (joint) probability of the observed values, given the current model parameters.
In normally distributed data: \(\text{SS}=\text{D}\).
Remember the three characteristics for every generalized linear model:
For the logistic model this gives us:
Simple substitution brings us at
\[p_i = \frac{\text{exp}(\eta)}{1+\text{exp}(\eta)} = \frac{\text{exp}(\beta_0 + \beta_1x_{1,i} + \dots + \beta_nx_{n,i})}{1+\text{exp}(\beta_0 + \beta_1x_{1,i} + \dots + \beta_nx_{n,i})}\]
anesthetic data move conc logconc nomove
1 0 1.0 0.0000000 1
2 1 1.2 0.1823216 0
3 0 1.4 0.3364722 1
4 1 1.4 0.3364722 0
5 1 1.2 0.1823216 0
6 0 2.5 0.9162907 1
7 0 1.6 0.4700036 1
8 1 0.8 -0.2231436 0
9 0 1.6 0.4700036 1
10 1 1.4 0.3364722 0
Thirty patients were given an anesthetic agent maintained at a predetermined level (conc) for 15 minutes before making an incision. It was then noted whether the patient moved, i.e. jerked or twisted.
Fitting a glm in R is not much different from fitting a lm. We do, however, need to specify what type of glm to use by specifying both the family and the type of link function we need.
For logistic regression we need the binomial family as the binomial distribution is the probability distribution that describes our outcome. We also use the logit link, which is the default for the binomial glm family.
Call: glm(formula = nomove ~ conc, family = binomial(link = "logit"))
Coefficients:
(Intercept) conc
-6.469 5.567
Degrees of Freedom: 29 Total (i.e. Null); 28 Residual
Null Deviance: 41.46
Residual Deviance: 27.75 AIC: 31.75
Call:
glm(formula = nomove ~ conc, family = binomial(link = "logit"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.469 2.418 -2.675 0.00748 **
conc 5.567 2.044 2.724 0.00645 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 41.455 on 29 degrees of freedom
Residual deviance: 27.754 on 28 degrees of freedom
AIC: 31.754
Number of Fisher Scoring iterations: 5
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.468675 2.418470 -2.674697 0.007479691
conc 5.566762 2.043591 2.724010 0.006449448
With every unit increase in concentration conc, the log odds of not moving increases with 5.5667617. This increase can be considered different from zero as the p-value is 0.0064494.
In other words; an increase in conc will lower the probability of moving. We can verify this by modeling move instead of nomove:
library(caret)
pred <- fit %>% predict(type = "response")
confusionMatrix(data = factor(as.numeric(pred > 0.5), labels = c("move", "nomove")),
reference = factor(anesthetic$nomove, labels = c("move", "nomove")))Confusion Matrix and Statistics
Reference
Prediction move nomove
move 10 2
nomove 4 14
Accuracy : 0.8
95% CI : (0.6143, 0.9229)
No Information Rate : 0.5333
P-Value [Acc > NIR] : 0.002316
Kappa : 0.5946
Mcnemar's Test P-Value : 0.683091
Sensitivity : 0.7143
Specificity : 0.8750
Pos Pred Value : 0.8333
Neg Pred Value : 0.7778
Prevalence : 0.4667
Detection Rate : 0.3333
Detection Prevalence : 0.4000
Balanced Accuracy : 0.7946
'Positive' Class : move
With the error of the model (or lack thereof) - comes a problem.
better model is only due to our data? Training
If the model will only fit well on the data is has been trained on, then we are overfitting. We have then successfully modeled not only the data, but also (much of) the noise.
A great example is the library of babel. It contains every phrase, page, etc. that will ever be written in English. However, it is a most inefficient way of writing beautiful literature.
Testing
To avoid overfitting we can train the model on one data set and test its performance on another (seperate) data set that comes from the same true data generating model.
Validation
If we are also optimizing hyperparameters, then the in-between-step of validation makes sense. You then train the initial model on one data set, validate its optimization on another data set and finally test its performance on the last data set.
Collecting multiple independent data sets to realize a true train/validate/test approach is a costly endeavor that takes up a lot of time and resources.
Alternative: splitting the observed data
We can randomly split the data into 2 parts: one part for training and one part for testing
Solution: K-fold crossvalidation
Partition the data into \(K\) folds and use each fold once as the test set and the remaining \(K-1\) folds as the training set.
set.seed(123)
library(caret)
# define training control
train_control <- trainControl(method = "cv", number = 3, savePredictions = TRUE)
# train the model on training set
model <- train(as.factor(move) ~ conc,
data = DAAG::anesthetic,
trControl = train_control,
method = "glm",
family = binomial(link = "logit"))
# print cv scores
model$results parameter Accuracy Kappa AccuracySD KappaSD
1 none 0.7919192 0.5737505 0.121414 0.253953
Gerko Vink @ Anton de Kom Universiteit, Paramaribo